Overview of Data
The purpose of this dataset was to help find predictors of diabetes.
The goal is to help medical professionals identify patients with
potential risk factors of diabetes.The dataset contains information on
patient demographics such as age and gender, as well as medical
information including blood glucose levels and hypertension. The
observations in this dataset were obtained from research studies and
healthcare institutions. The dataset was obtained via Kaggle.
Mustafa, T. Z. (2021). Diabetes Prediction Dataset. Kaggle.
https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset
The dataset consists of 8 predictor variables for diabetes: ‘gender’
the patient’s gender as male or female. ‘age’ the age of the patient in
years. ‘hypertension’ yes or no, on whether the patient has
hypertension. ‘heart disease’ yes or no, on whether the patient has
heart disease. ‘smoking history’ the patient’s smoking history indicated
as never, no info, current, former or not current. ‘bmi’ the patient’s
body mass index (kg/m**2). ‘HbA1c level’ the patient’s average blood
sugar level over the past 2-3 months (%). ‘blood glucose level’ amount
of blood sugar in the patient’s blood (mg/dL).
#load in dataset
data <-read.csv('https://raw.githubusercontent.com/GabbyK8/Datasets/refs/heads/main/diabetes_prediction_dataset.csv')
diabetesd<-data
#change heart disease, diabetes and hypertension to 1 and 0 for easier use in MICE.
data$diabetes <- recode(data$diabetes, "1"="Yes", "0"="No")
data$hypertension <- recode(data$hypertension, "1"="Yes", "0"="No")
data$heart_disease <- recode(data$heart_disease, "1"="Yes", "0"="No")
#plot interaction between diabetes and hypertension.
ggplot(data, aes(x = hypertension, fill = diabetes)) +
geom_bar(position = "dodge") + # Use "stack" for a stacked bar chart
labs (title="Risk of Diabetes by Hypertension", x="Hypertension", y= "Frequency",
fill= "Diabetes")
Comparison of Two Categorical Variables
The purpose of this graph is to show the relationship between having
hypertension and diabetes. For this graph, I changed the values of 1 in
both the diabetes and hypertension variable to “Yes” and the value 0 to
“No.” These integers were meant to represent categorical variables with
1 correlating to being positive for hypertension or diabetes and 0
correlating to being negative to hypertension or diabetes. Based on our
graph, we can see that there is no correlation between having diabetes
and having hypertension, with most patients within our sample having
neither diabetes or hypertension. Thus, hypertension would not be a good
predictor of diabetes within patients.
interact = ggplot(data, aes(x =blood_glucose_level , y = HbA1c_level, group = diabetes, color = diabetes)) +
geom_line(size = 1) + # Lines representing interaction
geom_point(size = 2) + # Points for data
labs(
title = "Interaction of Blood Glucose and HbA1c Levels",
x = "Blood Glucose Level",
y = "HbA1c Level",
color = "Presence of Diabetes"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
axis.text = element_text(size = 12),
legend.position = "top"
)
interact
Comparison of Two Numerical Variables
The two variables being compared here are Blood Glucose Level and
HbA1c Level. HbA1cc is a shortened term for hemoglobin A1c. Hb1Ac level
is a measure of average blood sugar levels over 2-3 months and is
measured in %. Blood glucose level is the amount of blood sugar in a
patients blood at a given time. Our plot shows a correlation between
high levels in HbA1c and blood glucose levels and it’s relationship with
diabetes. This plot suggests patients with high levels of HbA1c and
blood glucose levels are at high risk for diabetes.
#plot of a numerical and categorical variable
boxplot(data$age ~ data$diabetes,
xlab="Patient Age",
ylab="Presence of Diabetes",
col = c("skyblue", "purple"),
main="Visualization of Diabetes by Age",
cex.main = 1.1,
col.main = "navy")
Comparison of a Numerical and Categorical Variables
This chart shows the relationship between age and diabetes diagnosis.
This chart shows that as patients grow older they are at higher risk for
diabetes. This would make age a predictor for diabetes, with the average
age for those without diabetes being 40 years and 60 years for those
with diabetes. However, we can see that we do have a few outliers for
those with diabetes with some patients being diagnosed at 20 or younger.
This could be from the population of people with Type I diabetes which
is usually diagnosed in patients under 20. Our dataset does not
distinguish between type I and type II diabetes.
Conclusion for Comparison of Variables
Diabetes can be predicted using age, blood glucose levels, and HbA1c
levels in patients. However, our results showed hypertension is not a
good predictor of diabetes. Based on our findings, it would be
interesting to look at factors that occur once a patient becomes older
that could lead to diabetes at an older age. As well, diet could be
another factor to look into, such as grams of sugar consumed on a daily
or weekly basis, to see if sugar consumption can predict diabetes.
Another important thing to analyze in future studies would be Type I
versus Type II diabetes, as it could be the reason for some of the
outliers in the boxplot.
Overview of Missing Variables
An important step in data analysis for machine learning is handling
missing values. Missing values in a dataset can occur for many reasons,
but not limited to non-response in surveys, participants leaving the
study, data entry errors and system limitations. If missing values are
not addressed, models can become biased and accuracy of the anaylysis
can decrease. This section examines different imputation strategies to
effectively handle missing values.
The imputation methods we will examine include:
Replacement Imputation for Categorical Features: Mode imputation and
KNN-based imputation.
Regression-based Imputation for Numerical Features: Predictive
modeling to estimate missing values.
Multiple Imputation: Advanced methods such as MICE to improve
robustness.
# create random observation ID and replace the corresponding obs with missing
ltr.missing.id <- sample(1:1000, 50, replace = FALSE)
gpa.missing.id <- sample(1:1000, 15, replace = FALSE)
diabetesd$bmi[ltr.missing.id] <- NA
diabetesd$heart_disease[gpa.missing.id] <- NA
diabetesd$smoking_history[diabetesd$smoking_history == "No Info"] <- NA
Mode Imputation for Categorical Variables
In this method, missing variables are replaced with the most frequent
category of the corresponding variable. Mode imputation ensures
consistency and works well when missing values are randomly distributed.
Below, mode imputation is used on the variable, heart disease, which is
1 if the patient has heart disease and 0 if the patient does not have
heart disease.The variable smoking history had current, former, no info,
never, and ever as possible values. No Info indicates that we do not
have information on the participant’s smoking history and will be
treated as a missing values.
diabetesd <-diabetesd
# Function to impute mode
Mode <- function(x) {
ux <- unique(na.omit(x)) # Remove NAs before computing mode
tab <- tabulate(match(x, ux))
mode_value <- ux[which.max(tab)]
# Ensure mode_value is of the same type as x
if (is.factor(x)) {
return(factor(mode_value, levels = levels(x)))
} else if (is.character(x)) {
return(as.character(mode_value))
} else {
return(as.numeric(mode_value))
}
}
#apply mode imputation
value_imputed <- data.frame(
original = diabetesd$heart_disease,
original2=diabetesd$smoking_history,
imputed_modehd = replace(diabetesd$heart_disease, is.na(diabetesd$heart_disease), Mode((diabetesd$heart_disease))),
imputed_modesh = replace(diabetesd$smoking_history, is.na(diabetesd$smoking_history), Mode(diabetesd$smoking_history)))
summary(value_imputed)
original original2 imputed_modehd imputed_modesh
Min. :0.00000 Length:100000 Min. :0.00000 Length:100000
1st Qu.:0.00000 Class :character 1st Qu.:0.00000 Class :character
Median :0.00000 Mode :character Median :0.00000 Mode :character
Mean :0.03943 Mean :0.03942
3rd Qu.:0.00000 3rd Qu.:0.00000
Max. :1.00000 Max. :1.00000
NA's :15
#plot the comparison of original data and mode imputed data
h1 <- ggplot(value_imputed, aes(x = original)) +
geom_histogram(bins=10, fill = "#ad1538", color = "#000000", position = "identity") +
ggtitle("Original for Heart Disease") +
theme_classic()
h2 <- ggplot(value_imputed, aes(x = original2)) +
geom_bar( fill = "#15ad4f", color = "#000000", position = "identity") +
ggtitle("Original for Smoking History") +
theme_classic()
h3 <- ggplot(value_imputed, aes(x = imputed_modehd)) +
geom_histogram(bins=10, fill = "#1543ad", color = "#000000", position = "identity") +
ggtitle("Mode-imputed for Heart Disease") +
theme_classic()
h4 <- ggplot(value_imputed, aes(x = imputed_modesh)) +
geom_bar( fill = "#ad8415", color = "#000000", position = "identity") +
ggtitle("Mode-imputed for Smoking History") +
theme_classic()
plot_grid(h1, h2, h3, h4, nrow = 2, ncol = 2)
Regression-Based Imputation for Numerical Variables
Using regression models, we can estimate missing values. For the
variable, bmi, we can predict the missing values using estimates from
our model. Bmi is a continuous variable, unlike the previous example
where our values could only be 0 or 1. In this method, a linear model
was created to help create estimates of our values.
#create a dataset for regression impute
regimpute<-diabetesd
# Identify rows where BMI is missing
missing_rows <- which(is.na(diabetesd$bmi)) # Get row indices
# Train a linear model using complete cases
lm_model <- lm(bmi ~ age + blood_glucose_level + HbA1c_level + heart_disease + hypertension + diabetes,
data = regimpute, na.action = na.exclude)
# Impute missing BMI values using the model
diabetesd$bmi[missing_rows] <- predict(lm_model, newdata = regimpute[missing_rows, ])
dep.mi <- bind_rows(
data.frame(value = diabetesd$bmi, imputed = regimpute$bmi))
#imp = "dep.imp1")
i7<-boxplot(diabetesd$bmi,
main = "Mean bmi in Original Data",
xlab = "BMI",
col = "blue",
border = "black",
horizontal = TRUE,
notch = TRUE
)
i8 <-boxplot(regimpute$bmi,
main = "Regression Imputation for BMI",
xlab = "BMI",
col = "red",
border = "black",
horizontal = TRUE,
notch = TRUE
)
MICE Imputation
MICE is multiple imputation techniques used to impute missing values.
This method also uses regression models and accounts for uncertainty by
generating multiple possible values. Using MICE, the variables smoking
history, heart disease, and bmi, can be imputated within one function.
MICE can handle different types of variables, imputates based on the
relationship between variables and it accounts for variability in
missing data.
#reload dataset so that it is free of mode and regression imputation done previously
data2 <- read.csv('https://raw.githubusercontent.com/GabbyK8/Datasets/refs/heads/main/diabetes_prediction_dataset.csv')
#create missing values and make smoking history a binary variable
ltr.missing.id <- sample(1:1000, 50, replace = FALSE)
gpa.missing.id <- sample(1:1000, 15, replace = FALSE)
data2$smoking_history[data2$smoking_history== "No Info"] <-NA
data2$bmi[ltr.missing.id] <- NA
data2$heart_disease[gpa.missing.id] <- NA
#make variables numeric
data2$gender <- as.numeric(as.factor(data2$gender))
data2$heart_disease<- as.factor(as.numeric(data2$heart_disease))
data2$smoking_history<-as.factor(data2$smoking_history)
#apply MICE imputation
df_mice <- mice(data2, method = c("","","","pmm","polyreg","pmm","","",""), m = 20, maxit=20,seed = 123, print=F)
# Complete dataset with imputed values
df_imputed <- complete(df_mice)
#plot comparisons of before and after MICE imputation for heart disease
g1 <- ggplot(value_imputed, aes(x = imputed_modehd)) +
geom_bar(fill = "#1543ad", color = "#000000", position = "identity") +
ggtitle("Mode-imputed") +
labs(x="Heart Disease")+
theme_classic()
g2 <- ggplot(value_imputed, aes(x = original)) +
geom_bar(fill = "#ad1538", color = "#000000", position = "identity") +
ggtitle("Original") +
labs(x="Heart Disease")+
theme_classic()
g3 <- ggplot(df_imputed, aes(x = heart_disease)) +
geom_bar(fill = "purple", color = "#000000", position = "identity") +
ggtitle("MICE") +
labs(x="Heart Disease")+
theme_classic()
plot_grid(g1, g2, g3, nrow=1, ncol =3)
#plot comparisons for smoking history
g4 <- ggplot(df_imputed, aes(x = smoking_history)) +
geom_bar( fill = "purple", color = "#000000", position = "identity") +
ggtitle("MICE") +
labs(x="Smoking History")+
theme_classic()
g5 <- ggplot(value_imputed, aes(x = original2)) +
geom_bar( fill = "red", color = "#000000", position = "identity") +
ggtitle("Original Smoking History") +
labs(x="Smoking History")+
theme_classic()
g6 <- ggplot(value_imputed, aes(x = imputed_modesh)) +
geom_bar( fill = "blue", color = "#000000", position = "identity") +
ggtitle("Mode-imputed") +
labs(x="Smoking History")+
theme_classic()
plot_grid(g4,g5,g6, nrow=3, ncol=1)
#plot comparisons for bmi
g7<-boxplot(diabetesd$bmi,
main = "Mean bmi in Original Data",
xlab = "BMI",
col = "blue",
border = "black",
horizontal = TRUE,
notch = TRUE
)
g8 <-boxplot(regimpute$bmi,
main = "Regression Imputation for BMI",
xlab = "BMI",
col = "red",
border = "black",
horizontal = TRUE,
notch = TRUE
)
g9 <-boxplot(df_imputed$bmi,
main = "MICE for BMI",
xlab = "BMI",
col = "purple",
border = "black",
horizontal = TRUE,
notch = TRUE
)
model5 <- with(df_mice, lm(diabetes ~ bmi+ heart_disease+ smoking_history ))
summary(pool(model5))
term estimate std.error statistic df
1 (Intercept) -1.588938e-01 0.0042264348 -37.59524385 8765.6364
2 bmi 8.387455e-03 0.0001291007 64.96833640 95903.5486
3 heart_disease1 2.213973e-01 0.0044025602 50.28829659 97099.1099
4 smoking_historyever 4.374612e-03 0.0044673218 0.97924717 768.3395
5 smoking_historyformer 3.923135e-02 0.0034242201 11.45701814 1528.3749
6 smoking_historynever -3.066871e-05 0.0026838513 -0.01142713 847.0962
7 smoking_historynot current 6.032198e-03 0.0038465169 1.56822334 577.2461
p.value
1 6.484235e-287
2 0.000000e+00
3 0.000000e+00
4 3.277661e-01
5 3.251856e-29
6 9.908854e-01
7 1.173770e-01
Skewness
Skewness is a measure of symmetry within a distribution. The skewness
measurements below are of the variables age, bmi, HbA1c level, and blood
glucose level, respectively. When skewness is equal to 0, we can assume
the data is normally distributed. When skewness is greater than 0, our
data is positively skewed, while a value less than zero indicates
negatively skewed data. For the diabetes dataset, the variables bmi and
blood glucose level are 1 or about 1, which indicate a significant right
skew in the distribution.
age1<- skewness(df_imputed$age)
bmi1<-skewness(df_imputed$bmi)
HbA1c1<-skewness( df_imputed$HbA1c_level)
glucose1<-skewness(df_imputed$blood_glucose_level)
mytable<-data.frame(age1,bmi1,HbA1c1, glucose1)
kable(mytable, caption= "Skewness of Numeric Variables")
Skewness of Numeric Variables
| -0.0519782 |
1.041976 |
-0.0668528 |
0.8216426 |
Analyzing Data
The next step is to find which variables are significantly important
when trying to predict diabetes. By using feature selection, we are able
to predict possible best fit models for our variable, diabetes. For this
data, we will focus on logistic regression. The variables of interest,
diabetes, is a binary variable indicating linear regression would not
make an ideal model.
Random Forest First, we will look at the best model
selected by feature selection through random forest:
# Prepare the data
df_trans$diabetes <- as.factor(df_trans$diabetes)
# Set up RFE control with random forest
control <- rfeControl(functions = rfFuncs, method = "cv", number = 10)
# Apply RFE to identify top features
results <- rfe(
x = df_trans[, !colnames(df_trans) %in% "diabetes"],
y = df_trans$diabetes,
sizes = c(1:9),
rfeControl = control
)
# Print the selected features
models<-data.frame(predictors(results))
kable(models, caption="Random Forest Selected Model")
Random Forest Selected Model
| HbA1c_level |
| blood_glucose_level |
| bmi |
| heart_disease |
| age |
| hypertension |
| gender |
| smoking_history |
Stepwise Logistic Regression Next, let’s look and
stepwise logistic regression
#Stepwise Logistic Regression
# Fit a full logistic regression model with all predictors
full_model <- glm(diabetes ~ ., data = df_trans, family = binomial)
# Perform stepwise selection (default is backward elimination)
step_model1 <- step(full_model, direction = "both")
Start: AIC=23047.57
diabetes ~ gender + age + hypertension + heart_disease + smoking_history +
bmi + HbA1c_level + blood_glucose_level
Df Deviance AIC
- smoking_history 4 23029 23045
<none> 23024 23048
- gender 1 23078 23100
- heart_disease 1 23176 23198
- hypertension 1 23308 23330
- bmi 1 24361 24383
- age 1 25096 25118
- blood_glucose_level 1 30572 30594
- HbA1c_level 1 34856 34878
Step: AIC=23044.52
diabetes ~ gender + age + hypertension + heart_disease + bmi +
HbA1c_level + blood_glucose_level
Df Deviance AIC
<none> 23029 23045
+ smoking_history 4 23024 23048
- gender 1 23087 23101
- heart_disease 1 23184 23198
- hypertension 1 23314 23328
- bmi 1 24369 24383
- age 1 25193 25207
- blood_glucose_level 1 30582 30596
- HbA1c_level 1 34865 34879
summary(step_model1)
Call:
glm(formula = diabetes ~ gender + age + hypertension + heart_disease +
bmi + HbA1c_level + blood_glucose_level, family = binomial,
data = df_trans)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.99142 0.11057 -90.360 < 2e-16 ***
gender 0.27068 0.03536 7.654 1.94e-14 ***
age 1.03388 0.02428 42.583 < 2e-16 ***
hypertension 0.80228 0.04665 17.198 < 2e-16 ***
heart_disease1 0.76240 0.06014 12.677 < 2e-16 ***
bmi 10.23780 0.28095 36.439 < 2e-16 ***
HbA1c_level 2.49236 0.03769 66.122 < 2e-16 ***
blood_glucose_level 10.29954 0.14902 69.113 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 58163 on 99999 degrees of freedom
Residual deviance: 23029 on 99992 degrees of freedom
AIC: 23045
Number of Fisher Scoring iterations: 8
Both selection types selected the same model! However I will still go
through the steps of cross-validation as though I was comparing the two
models.
Cross-Validation
Let’s explore cross-validation to see which model is best fit. For
logistic regression, it is meaningful to look at accuracy and kappa.
#add classification
df_trans$diabetes<- factor(ifelse(df_trans$diabetes==1, "Yes","No"))
train_control <- trainControl(
method = "cv", # cross-validation
number = 10, # 10-fold
classProbs = TRUE # use if you're tracking AUC, Sensitivity, etc.
)
#run CV model
cv_model <- train(
diabetes ~ hypertension + gender +age + HbA1c_level + blood_glucose_level + heart_disease + bmi,
data=df_trans,
method = "glm",
family = "binomial",
trControl = train_control
)
print(cv_model)
Generalized Linear Model
1e+05 samples
7e+00 predictors
2e+00 classes: 'No', 'Yes'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 90000, 90000, 90000, 90000, 90000, 90000, ...
Resampling results:
Accuracy Kappa
0.95935 0.7015624
Accuracy shows the overall correctness of the model predictions and
kappa measures the levels of agreement between variables. Values closer
to 1 suggest higher accuracy and better agreement. This model suggests
high accuracy and high agreement based on the values.
ROC Curve and AOC Curve
Confusion Matrices ROC is also known as Receiver
Operating Characteristic Analysis. It is a technique using graphs that
evaluates the performance of a binary model. Before plotting the ROC
curve, it is necessary to calculate the true positive rate (TPR) and
false positive rate (FPR). Below, are confusion matrices that were used
to calculate those values.
#ROC and AUC
df_trans$diabetes <- as.factor(df_trans$diabetes)
# fit a logistic
model.logit <- glm(diabetes ~ age + gender+ bmi+hypertension + heart_disease + blood_glucose_level+ HbA1c_level , family = binomial, data = df_trans)
# predict probability of P(Y = "Yes")
probabilities <- round(as.vector(predict(model.logit, type = "response")),3)
#
thresholds <- c(0.0, 0.25, 0.5, 0.75, 1.0)
# Loop through thresholds and create confusion matrices
for (threshold in thresholds) {
cat("\nConfusion Matrix for Threshold =", threshold, "\n")
# Convert probabilities to predictions
# am: 1 = manual transmission, 0 = automatic transmission
predictions <- ifelse(probabilities > threshold, "Yes", "No")
all_levels <- union(levels(factor(df_trans$diabetes)), levels(factor(predictions)))
# Generate confusion matrix
cm <- confusionMatrix(factor(predictions, levels=all_levels), factor(df_trans$diabetes,levels=all_levels), positive = "Yes")
print(cm$table)
}
Confusion Matrix for Threshold = 0
Reference
Prediction No Yes
No 354 0
Yes 8146 91500
Confusion Matrix for Threshold = 0.25
Reference
Prediction No Yes
No 4216 146
Yes 4284 91354
Confusion Matrix for Threshold = 0.5
Reference
Prediction No Yes
No 5307 867
Yes 3193 90633
Confusion Matrix for Threshold = 0.75
Reference
Prediction No Yes
No 6322 3340
Yes 2178 88160
Confusion Matrix for Threshold = 1
Reference
Prediction No Yes
No 8500 91500
Yes 0 0
ROC Curve
#create ROC curve
TPR = c(1,91500/(91500+0), 91356/(91356+144), 90632/(90632+868), 88160/(88160+3340), 0/(91500+0))
FPR = c(1,352/(352+8148), 4282/(4282+4218), 3193/(5307+3193), 2178/(2178+6322), 0/(8500+0))
plot(FPR, TPR, type = "b", main = "An Illustrative ROC Curve", col ="blue",
xlab="1 - Specifity (FPR)", ylab = "Sensitivity (TPR)")
# add a off-diagonal representing random guess algorithm in binary prediction
abline(0,1, lty = 2, col = "red")
# legend
legend("bottomright", c("Logistic Model", "Random Guess"),
col=c("blue", "red"), lty = 1:2, bty="n", cex = 0.9)
AUC (Area under the Curve) AUC quantifies the
performance of the ROC curve into a single value that ranges from 0 to
1. The plot below uses the Riemann Sum approximation to estimate the
area under the curve.
#AUC
TPR = round(c(1,91500/(91500+0), 91356/(91356+144), 90632/(90632+868), 88160/(88160+3340), 0/(91500+0)),3)
FPR = round(c(1,352/(352+8148), 4282/(4282+4218), 3193/(5307+3193), 2178/(2178+6322), 0/(8500+0)),3)
TPR0 = TPR[7:1]
FPR0 = FPR[7:1]
#plot(FPR0, TPR0, type = "b")
datSenSpe = data.frame(TPR0, FPR0)
ggROC = ggplot(data = datSenSpe, aes(x = FPR0, y=TPR0)) +
geom_line(col = "steelblue") +
geom_point(col = "red") +
geom_segment(x = FPR0, y = 0, xend = FPR0, yend = TPR0, color = 4) +
geom_segment(x = 0, y = 0, xend = FPR0[7], yend = 0, color = 6) +
ggtitle("Approximating the AUC of Logistic Model") +
xlab("1-specificity (FPR)") +
ylab("Sensitivity (TPR)") +
annotate("text", x = 0.025, y = 0.125, label= "A") +
annotate("text", x = 0.105, y = 0.5, label = "B") +
annotate("text", x = 0.605, y = 0.5, label = "C") +
theme(plot.title = element_text(hjust = 0.5),
legend.position = c(0.8, 0.2),
plot.margin = unit(c(0.15, 0.15, 0.75, 0.15), "inches"),
axis.line = element_line(size = 2, colour = "navy", linetype=1))
# partition the region under the ROC into trapezoids
ggplotly(ggROC)
AUC Value
A<-(0.041*1)/2
B<- ((0.504-0.041)*1)
C<- ((1-0.504)*1)
AUC<- A+B+C
kable(AUC, caption= "Area under the Curve")
Area under the Curve
| 0.9795 |
Conclusion for Model Fit
Our best fit model is : \[
\text{ Diabetes} = 9.99 - 0.27\times \text{gender} - 1.03\times
\text{age} -0.8\times \text{hypertension} - 0.76\times \text{heart
disease} -10.24\times \text{bmi} - 2.49\times \text{HbA1c level} -
-10.3\times \text{blood glucose level}
\]
The best fit model was selected based on a variety of different
features. Both stepwise selection and random forest selected the model
as best fit. After running performance analyses on the model, the model
was shown to be of high performance. This means that the variables
chosen for the final model are significant for predicting diabetes. The
model chose not to select smoking history as significant predictor for
the response variable. A possible suggestion for this would be to reduce
the number of response values to a binary response. Such as, “Have you
ever smoked?” and have the response be “Yes” or “No”. This could help in
predicting the relationship between smoking and diabetes.
---
title: "Applied Machine Learning for Data Analysis"
author: "Gabriella Khalil"
date: "2025-02-18"
output: 
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    number_sections: no
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  word_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: no
    fig_width: 3
    fig_height: 3
editor_options: 
  chunk_output_type: inline
---


```{=html}

<style type="text/css">

/* Cascading Style Sheets (CSS) is a stylesheet language used to describe the presentation of a document written in HTML or XML. it is a simple mechanism for adding style (e.g., fonts, colors, spacing) to Web documents. */

h1.title {  /* Title - font specifications of the report title */
  font-size: 22px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}
h4.author { /* Header 4 - font specifications for authors  */
  font-size: 18px;
  font-weight: bold;
  font-family: system-ui;
  color: navy;
  text-align: center;
}
h4.date { /* Header 4 - font specifications for the date  */
  font-size: 18px;
  font-family: system-ui;
  color: DarkBlue;
  text-align: center;
  font-weight: bold;
}
h1 { /* Header 1 - font specifications for level 1 section title  */
    font-size: 22px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: center;
    font-weight: bold;
}
h2 { /* Header 2 - font specifications for level 2 section title */
    font-size: 20px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
    font-weight: bold;
}

h3 { /* Header 3 - font specifications of level 3 section title  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - font specifications of level 4 section title  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

</style>
```

```{r setup, include=FALSE}
# code chunk specifies whether the R code, warnings, and output 
# will be included in the output files.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("tidyverse")) {
   install.packages("tidyverse")
library(tidyverse)
}
if (!require("GGally")) {
   install.packages("GGally")
library(GGally)
}
knitr::opts_chunk$set(echo = TRUE,   # include code chunk in the output file
                      warning = FALSE,# sometimes, you code may produce warning messages,
                                      # you can choose to include the warning messages in
                                      # the output file. 
                      results = TRUE, # you can also decide whether to include the output
                                      # in the output file.
                      message = FALSE,
                      comment = NA
                      )  
library(tidyverse)
library(VIM) 
library(dplyr)
library(Hmisc)
library(ggplot2)
library(plotly)
library(DescTools)
library(cowplot)
library(mice)
library(moments)
library(tidyr)
library(reshape2)  # For data reshaping
library(caret)
library(pROC)
library(randomForest)
library(fastDummies)
```
# Overview of Data

The purpose of this dataset was to help find predictors of diabetes. The goal is to help medical professionals identify patients with potential risk factors of diabetes.The dataset contains information on patient demographics such as age and gender, as well as medical information including blood glucose levels and hypertension. The observations in this dataset were obtained from research studies and healthcare institutions. The dataset was obtained via Kaggle. 

Mustafa, T. Z. (2021). *Diabetes Prediction Dataset*. Kaggle. https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset

The dataset consists of 8 predictor variables for diabetes:
'gender' the patient's gender as male or female.
'age' the age of the patient in years.
'hypertension' yes or no, on whether the patient has hypertension.
'heart disease' yes or no, on whether the patient has heart disease.
'smoking history' the patient's smoking history indicated as never, no info, current, former or not current.
'bmi' the patient's body mass index (kg/m**2).
'HbA1c level' the patient's average blood sugar level over the past 2-3 months (%).
'blood glucose level' amount of blood sugar in the patient's blood (mg/dL).



```{r fig.align='center', fig.width=7, fig.height=5, fig.cap='The plot of shows the frequency of patients with diabetes and hypertension. The plot indiactes a higher frequency of patients not having diabetes and not having hypertension. There does not seem to be an interaction between having diabetes and having hypertension based on the low frequency of patients having both. '}
#load in dataset
data <-read.csv('https://raw.githubusercontent.com/GabbyK8/Datasets/refs/heads/main/diabetes_prediction_dataset.csv')
diabetesd<-data

#change heart disease, diabetes and hypertension to 1 and 0 for easier use in MICE.
data$diabetes <- recode(data$diabetes, "1"="Yes", "0"="No")
data$hypertension <- recode(data$hypertension, "1"="Yes", "0"="No")
data$heart_disease <- recode(data$heart_disease, "1"="Yes", "0"="No")

#plot interaction between diabetes and hypertension.
ggplot(data, aes(x = hypertension, fill = diabetes)) +
  geom_bar(position = "dodge") + # Use "stack" for a stacked bar chart
  labs (title="Risk of Diabetes by Hypertension", x="Hypertension", y= "Frequency",
        fill= "Diabetes")

```

# Comparison of Two Categorical Variables

The purpose of this graph is to show the relationship between having hypertension and diabetes. For this graph, I changed the values of 1 in both the diabetes and hypertension variable to "Yes" and the value 0 to "No." These integers were meant to represent categorical variables with 1 correlating to being positive for hypertension or diabetes and 0 correlating to being negative to hypertension or diabetes. Based on our graph, we can see that there is no correlation between having diabetes and having hypertension, with most patients within our sample having neither diabetes or hypertension. Thus, hypertension would not be a good predictor of diabetes within patients.
```{r fig.align='center', fig.width=7, fig.height=5, fig.cap='The plot shows the interaction between blood glucose levels and HbA1c levels on diabetes. Based on the plot, higher blood glucose levels and HbA1c levels are positively correlated with the presence of diabetes. In comparison, lower levels, indicate that a patient will more commonly not have diabetes. The height of the points shown for diabetes represent the levels of HbA1c, based on the plot, HbA1c levels predict the presence of diabetes more often than blood glucose levels. '}


interact = ggplot(data, aes(x =blood_glucose_level , y = HbA1c_level, group = diabetes, color = diabetes)) +
  geom_line(size = 1) +  # Lines representing interaction
  geom_point(size = 2) + # Points for data
  labs(
    title = "Interaction of Blood Glucose and HbA1c Levels",
    x = "Blood Glucose Level",
    y = "HbA1c Level",
    color = "Presence of Diabetes"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    axis.text = element_text(size = 12),
    legend.position = "top"
  )
interact

```
# Comparison of Two Numerical Variables

The two variables being compared here are Blood Glucose Level and HbA1c Level. HbA1cc is a shortened term for hemoglobin A1c. Hb1Ac level is a measure of average blood sugar levels over 2-3 months and is measured in %. Blood glucose level is the amount of blood sugar in a patients blood at a given time. Our plot shows a correlation between high levels in HbA1c and blood glucose levels and it's relationship with diabetes. This plot suggests patients with high levels of HbA1c and blood glucose levels are at high risk for diabetes.

```{r fig.align='center', fig.width=7, fig.height=5, fig.cap='The boxplot shows the relationship between age and presence of diabetes. The boxplot shows that older patients on average are more often have diabetes. There are a few outliers in the diabetes population that appear to be under the age of 20 years. This could be representative of patients with Type I diabetes, since the data does not distinguish between Type I and Type II diabetes. '}

#plot of a numerical and categorical variable
 boxplot(data$age ~ data$diabetes,
         xlab="Patient Age",
         ylab="Presence of Diabetes",
         col = c("skyblue", "purple"),
         main="Visualization of Diabetes by Age",
         cex.main = 1.1,
         col.main = "navy")
```

# Comparison of a Numerical and Categorical Variables

This chart shows the relationship between age and diabetes diagnosis. This chart shows that as patients grow older they are at higher risk for diabetes. This would make age a predictor for diabetes, with the average age for those without diabetes being 40 years and 60 years for those with diabetes. However, we can see that we do have a few outliers for those with diabetes with some patients being diagnosed at 20 or younger. This could be from the population of people with Type I diabetes which is usually diagnosed in patients under 20. Our dataset does not distinguish between type I and type II diabetes.


# Conclusion for Comparison of Variables

Diabetes can be predicted using age, blood glucose levels, and HbA1c levels in patients. However, our results showed hypertension is not a good predictor of diabetes. Based on our findings, it would be interesting to look at factors that occur once a patient becomes older that could lead to diabetes at an older age. As well, diet could be another factor to look into, such as grams of sugar consumed on a daily or weekly basis, to see if sugar consumption can predict diabetes. Another important thing to analyze in future studies would be Type I versus Type II diabetes, as it could be the reason for some of the outliers in the boxplot. 


## Overview of Missing Variables

An important step in data analysis for machine learning is handling missing values. Missing values in a dataset can occur for many reasons, but not limited to non-response in surveys, participants leaving the study, data entry errors and system limitations. If missing values are not addressed, models can become biased and accuracy of the anaylysis can decrease. This section examines different imputation strategies to effectively handle missing values.

The imputation methods we will examine include:

Replacement Imputation for Categorical Features: Mode imputation and KNN-based imputation.

Regression-based Imputation for Numerical Features: Predictive modeling to estimate missing values.

Multiple Imputation: Advanced methods such as MICE to improve robustness.

```{r}
# create random observation ID and replace the corresponding obs with missing
ltr.missing.id <- sample(1:1000, 50, replace = FALSE)
gpa.missing.id <- sample(1:1000, 15, replace = FALSE) 
diabetesd$bmi[ltr.missing.id] <- NA
diabetesd$heart_disease[gpa.missing.id] <- NA
diabetesd$smoking_history[diabetesd$smoking_history == "No Info"] <- NA

```

# Mode Imputation for Categorical Variables

In this method, missing variables are replaced with the most frequent category of the corresponding variable. Mode imputation ensures consistency and works well when missing values are randomly distributed. Below, mode imputation is used on the variable, heart disease, which is 1 if the patient has heart disease and 0 if the patient does not have heart disease.The variable smoking history had current, former, no info, never, and ever as possible values. No Info indicates that we do not have information on the participant's smoking history and will be treated as a missing values.

```{r fig.align='center', fig.width=7, fig.height=5, fig.cap='The plot shows the comparison between the original data containing missing values for the variables heart disease and smoking history, and the data after using mode impute. The histograms show that mode impute used what was most common to replace the missing values. This is seen clearly in the variable smoking history, where "never" went from originally a count of about 40,000 to about 70,000. Heart disease had a small amount of missing data (only 15), however the 15 missing were changed to 0, indicating the patient would not have heart disease.'}


diabetesd <-diabetesd
# Function to impute mode

Mode <- function(x) {
  ux <- unique(na.omit(x))  # Remove NAs before computing mode
  tab <- tabulate(match(x, ux))
  mode_value <- ux[which.max(tab)]
  
  # Ensure mode_value is of the same type as x
  if (is.factor(x)) {
    return(factor(mode_value, levels = levels(x)))
  } else if (is.character(x)) {
    return(as.character(mode_value))
  } else {
    return(as.numeric(mode_value))
  }
}
#apply mode imputation
value_imputed <- data.frame(
  original = diabetesd$heart_disease,
  original2=diabetesd$smoking_history,
  imputed_modehd = replace(diabetesd$heart_disease, is.na(diabetesd$heart_disease), Mode((diabetesd$heart_disease))),
  imputed_modesh = replace(diabetesd$smoking_history, is.na(diabetesd$smoking_history), Mode(diabetesd$smoking_history)))
summary(value_imputed)

#plot the comparison of original data and mode imputed data
h1 <- ggplot(value_imputed, aes(x = original)) +
  geom_histogram(bins=10, fill = "#ad1538", color = "#000000", position = "identity") +
  ggtitle("Original for Heart Disease") +
  theme_classic()
h2 <- ggplot(value_imputed, aes(x = original2)) +
  geom_bar( fill = "#15ad4f", color = "#000000", position = "identity") +
  ggtitle("Original for Smoking History") +
  theme_classic()
h3 <- ggplot(value_imputed, aes(x = imputed_modehd)) +
  geom_histogram(bins=10, fill = "#1543ad", color = "#000000", position = "identity") +
  ggtitle("Mode-imputed for Heart Disease") +
  theme_classic()
h4 <- ggplot(value_imputed, aes(x = imputed_modesh)) +
  geom_bar( fill = "#ad8415", color = "#000000", position = "identity") +
  ggtitle("Mode-imputed for Smoking History") +
  theme_classic()


plot_grid(h1, h2, h3, h4, nrow = 2, ncol = 2)



```
# Regression-Based Imputation for Numerical Variables

Using regression models, we can estimate missing values. For the variable, bmi, we can predict the missing values using estimates from our model. Bmi is a continuous variable, unlike the previous example where our values could only be 0 or 1. In this method, a linear model was created to help create estimates of our values. 

```{r fig.align='center', fig.width=7, fig.height=5, fig.cap='The boxplots show the comparison of the variable bmi before and after regression imputation.' }

#create a dataset for regression impute
regimpute<-diabetesd

# Identify rows where BMI is missing
missing_rows <- which(is.na(diabetesd$bmi))  # Get row indices


# Train a linear model using complete cases
lm_model <- lm(bmi ~ age + blood_glucose_level + HbA1c_level + heart_disease + hypertension + diabetes, 
               data = regimpute, na.action = na.exclude)

# Impute missing BMI values using the model
diabetesd$bmi[missing_rows] <- predict(lm_model, newdata = regimpute[missing_rows, ])


dep.mi <- bind_rows(
  data.frame(value = diabetesd$bmi, imputed = regimpute$bmi))
             #imp = "dep.imp1")


i7<-boxplot(diabetesd$bmi,
main = "Mean bmi in Original Data",
xlab = "BMI",
col = "blue",
border = "black",
horizontal = TRUE,
notch = TRUE
)
i8 <-boxplot(regimpute$bmi,
main = "Regression Imputation for BMI",
xlab = "BMI",
col = "red",
border = "black",
horizontal = TRUE,
notch = TRUE
)


 

```

# MICE Imputation

MICE is multiple imputation techniques used to impute missing values. This method also uses regression models and accounts for uncertainty by generating multiple possible values. Using MICE, the variables smoking history, heart disease, and bmi, can be imputated within one function. MICE can handle different types of variables, imputates based on the relationship between variables and it accounts for variability in missing data. 
```{r fig.align='center', fig.width=7, fig.height=5, fig.cap='Based on the three graphs, MICE seems to have produced values that appear almost identical to the original dataset. Mode impute seems to be slightly different since it relies on filling missing values with the most common value. In this case, mode impute filled all missing values as 0 and under-represented 1.'}

#reload dataset so that it is free of mode and regression imputation done previously
data2 <- read.csv('https://raw.githubusercontent.com/GabbyK8/Datasets/refs/heads/main/diabetes_prediction_dataset.csv')

#create missing values and make smoking history a binary variable
ltr.missing.id <- sample(1:1000, 50, replace = FALSE)
gpa.missing.id <- sample(1:1000, 15, replace = FALSE) 
data2$smoking_history[data2$smoking_history== "No Info"] <-NA
data2$bmi[ltr.missing.id] <- NA
data2$heart_disease[gpa.missing.id] <- NA

#make variables numeric
data2$gender <- as.numeric(as.factor(data2$gender))
data2$heart_disease<- as.factor(as.numeric(data2$heart_disease))
data2$smoking_history<-as.factor(data2$smoking_history)
#apply MICE imputation
df_mice <- mice(data2, method = c("","","","pmm","polyreg","pmm","","",""),  m = 20, maxit=20,seed = 123, print=F)

# Complete dataset with imputed values
df_imputed <- complete(df_mice)

#plot comparisons of before and after MICE imputation for heart disease
g1 <- ggplot(value_imputed, aes(x = imputed_modehd)) +
  geom_bar(fill = "#1543ad", color = "#000000", position = "identity") +
  ggtitle("Mode-imputed") +
  labs(x="Heart Disease")+
  theme_classic()
g2 <- ggplot(value_imputed, aes(x = original)) +
  geom_bar(fill = "#ad1538", color = "#000000", position = "identity") +
  ggtitle("Original") +
  labs(x="Heart Disease")+
  theme_classic()
g3 <- ggplot(df_imputed, aes(x = heart_disease)) +
  geom_bar(fill = "purple", color = "#000000", position = "identity") +
  ggtitle("MICE") +
  labs(x="Heart Disease")+
  theme_classic()
plot_grid(g1, g2, g3, nrow=1, ncol =3)
```

```{r fig.align='center', fig.width=7, fig.height=5, fig.cap='The bar graphs show the comparison of the original data to mode imputation and MICE. MICE appears to have dispersed the data more evenly across the different categories, while mode imputation has only added values to "never.'}
#plot comparisons for smoking history
g4 <- ggplot(df_imputed, aes(x = smoking_history)) +
  geom_bar( fill = "purple", color = "#000000", position = "identity") +
  ggtitle("MICE") +
  labs(x="Smoking History")+
  theme_classic()
g5 <- ggplot(value_imputed, aes(x = original2)) +
  geom_bar( fill = "red", color = "#000000", position = "identity") +
  ggtitle("Original Smoking History") +
  labs(x="Smoking History")+
  theme_classic()
g6 <- ggplot(value_imputed, aes(x = imputed_modesh)) +
  geom_bar( fill = "blue", color = "#000000", position = "identity") +
  ggtitle("Mode-imputed") +
  labs(x="Smoking History")+
  theme_classic()
plot_grid(g4,g5,g6, nrow=3, ncol=1)
```
```{r fig.align='center', fig.width=7, fig.height=5, fig.cap='The boxplots show the comparison of the original data to both MICE and regression imputation.'}

#plot comparisons for bmi
g7<-boxplot(diabetesd$bmi,
main = "Mean bmi in Original Data",
xlab = "BMI",
col = "blue",
border = "black",
horizontal = TRUE,
notch = TRUE
)
g8 <-boxplot(regimpute$bmi,
main = "Regression Imputation for BMI",
xlab = "BMI",
col = "red",
border = "black",
horizontal = TRUE,
notch = TRUE
)
g9 <-boxplot(df_imputed$bmi,
main = "MICE for BMI",
xlab = "BMI",
col = "purple",
border = "black",
horizontal = TRUE,
notch = TRUE
)



```

```{r fig.align='center', fig.width=7, fig.height=5}
model5 <- with(df_mice, lm(diabetes ~ bmi+ heart_disease+ smoking_history ))
summary(pool(model5))
```
# Skewness

Skewness is a measure of symmetry within a distribution. The skewness measurements below are of the variables age, bmi, HbA1c level, and blood glucose level, respectively. When skewness is equal to 0, we can assume the data is normally distributed. When skewness is greater than 0, our data is positively skewed, while a value less than zero indicates negatively skewed data. For the diabetes dataset, the variables bmi and blood glucose level are 1 or about 1, which indicate a significant right skew in the distribution. 
```{r}



age1<- skewness(df_imputed$age) 
bmi1<-skewness(df_imputed$bmi) 
HbA1c1<-skewness( df_imputed$HbA1c_level) 
glucose1<-skewness(df_imputed$blood_glucose_level)

mytable<-data.frame(age1,bmi1,HbA1c1, glucose1)
kable(mytable, caption= "Skewness of Numeric Variables")

```
# Transforming Data

To handle our skewed variables, we can normalize the variables bmi and blood glucose level. Normalization rescales data into a fixed range of [0,1]. By normalizing our data, we can fix skewness and create normally distributed data. The variables, age and HbA1c level have been standardized to keep the distribution centered. Standardization is used to create a mean of 0 and a standard deviation of 1, to follow a normal distribution. Based on the results below, we can see that the transformation has corrected the right skew in the data from before.


```{r}
df_trans <- df_imputed  # Copy original dataset after MICE

#ensure diabetes is numeric and a factor
df_trans$diabetes<- as.numeric(as.factor(df_trans$diabetes))
# Min-Max Normalization
normalize <- function(x) {
  if (max(x) - min(x) == 0) return(rep(0, length(x)))  # Prevent division by zero
  return((x - min(x)) / (max(x) - min(x)))
}

#apply normalization
df_trans$bmi <- normalize(df_imputed$bmi)
df_trans$blood_glucose_level <- normalize(df_imputed$blood_glucose_level)

# Z-Score Standardization
standardize <- function(x) {
  return((x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE))
}

#apply standardization
df_trans$age <- standardize(df_trans$age)
df_trans$HbA1c_level <- standardize(df_trans$HbA1c_level)

# Apply Log Transformation to Fix Skewness (if necessary)
df_trans$bmi <- log1p(df_trans$bmi)  # log(1 + x) avoids log(0) issues
df_trans$blood_glucose_level <- log1p(df_trans$blood_glucose_level)

bmi2 <- skewness(df_trans$bmi)
age2 <- skewness(df_trans$age)
HbA1c2<- skewness(df_trans$HbA1c_level)
glucose2<- skewness(df_trans$blood_glucose_level)

newtable<-data.frame(bmi2,age2,HbA1c2,glucose2)
kable(newtable, caption="Newly Transformed Data")


```
Transforming our data has successfully decreased skewness in our variables. We can see bmi and blood glucose level are no longer about 1. 

```{r fig.align='center', fig.width=7, fig.height=5, fig.cap='We can see in the graphs above that after transforming the dataset, the data seems to follow a normal distribution more closely.This is apparent in the histograms appearing more in a bell shape curve.' }

# Combine old and new data for comparison
df_compare <- data.frame(
  bmi_original = df_imputed$bmi,
  bmi_transformed = df_trans$bmi,
  blood_glucose_original = df_imputed$blood_glucose_level,
  blood_glucose_transformed = df_trans$blood_glucose_level
)

# Reshape data for ggplot
df_long <- melt(df_compare)

# Plot histograms
ggplot(df_long, aes(x = value, fill = variable)) +
  geom_histogram(alpha = 0.6, bins = 30, position = "identity") +
  facet_wrap(~ variable, scales = "free") +
  theme_minimal() +
  labs(title = "Comparison of Original vs. Transformed Data", x = "Value", y = "Count")


```




# Analyzing Data
The next step is to find which variables are significantly important when trying to predict diabetes. By using feature selection, we are able to predict possible best fit models for our variable, diabetes. For this data, we will focus on logistic regression. The variables of interest, diabetes, is a binary variable indicating linear regression would not make an ideal model. 



**Random Forest** 
First, we will look at the best model selected by feature selection through random forest:
```{r fig.align='center', fig.width=7, fig.height=5,fig.cap='The table shows the selected variables random forest has selected to use for our model.'}

# Prepare the data
df_trans$diabetes <- as.factor(df_trans$diabetes)


# Set up RFE control with random forest
control <- rfeControl(functions = rfFuncs, method = "cv", number = 10)

# Apply RFE to identify top features
results <- rfe(
  x = df_trans[, !colnames(df_trans) %in% "diabetes"],
  y = df_trans$diabetes,
  sizes = c(1:9),
  rfeControl = control
)

# Print the selected features
models<-data.frame(predictors(results))
kable(models, caption="Random Forest Selected Model")


```
**Stepwise Logistic Regression**
Next, let's look and stepwise logistic regression

```{r fig.align='center', fig.width=7, fig.height=5, fig.cap='The coefficients shown below are the ones stepwise logistic regression has chosen for the final model'}
#Stepwise Logistic Regression
# Fit a full logistic regression model with all predictors
full_model <- glm(diabetes ~ ., data = df_trans, family = binomial)

# Perform stepwise selection (default is backward elimination)
step_model1 <- step(full_model, direction = "both")
summary(step_model1)
```

Both selection types selected the same model! However I will still go through the steps of cross-validation as though I was comparing the two models.

# Cross-Validation

Let's explore cross-validation to see which model is best fit. For logistic regression, it is meaningful to look at accuracy and kappa. 
```{r fig.align='center', fig.width=7, fig.height=5}

#add classification
df_trans$diabetes<- factor(ifelse(df_trans$diabetes==1, "Yes","No"))

train_control <- trainControl(
  method = "cv",         # cross-validation
  number = 10,           # 10-fold
  classProbs = TRUE # use if you're tracking AUC, Sensitivity, etc.
)

#run CV model
cv_model <- train(
  diabetes ~ hypertension + gender +age + HbA1c_level + blood_glucose_level + heart_disease + bmi, 
  data=df_trans, 
  method = "glm", 
  family = "binomial", 
  trControl = train_control
  )

print(cv_model)

```
Accuracy shows the overall correctness of the model predictions and kappa measures the levels of agreement between variables. Values closer to 1 suggest higher accuracy and better agreement. This model suggests high accuracy and high agreement based on the values.

# ROC Curve and AOC Curve

**Confusion Matrices**
ROC is also known as Receiver Operating Characteristic Analysis. It is a technique using graphs that evaluates the performance of a binary model. Before plotting the ROC curve, it is necessary to calculate the true positive rate (TPR) and false positive rate (FPR). Below, are confusion matrices that were used to calculate those values.
```{r fig.align='center', fig.width=7, fig.height=5,'These matrices are used to calculate TPR and FPR. These values are used to create the ROC curve.'}

#ROC and AUC
df_trans$diabetes <- as.factor(df_trans$diabetes)

# fit a logistic
model.logit <- glm(diabetes ~ age + gender+ bmi+hypertension + heart_disease + blood_glucose_level+ HbA1c_level , family = binomial, data = df_trans)
# predict probability of P(Y = "Yes")
probabilities <- round(as.vector(predict(model.logit, type = "response")),3)
#
thresholds <- c(0.0, 0.25, 0.5, 0.75, 1.0)

# Loop through thresholds and create confusion matrices
for (threshold in thresholds) {
  cat("\nConfusion Matrix for Threshold =", threshold, "\n")
  
  # Convert probabilities to predictions
  # am: 1 = manual transmission, 0 = automatic transmission
  predictions <- ifelse(probabilities > threshold, "Yes", "No")
    all_levels <- union(levels(factor(df_trans$diabetes)), levels(factor(predictions)))
    # Generate confusion matrix
  cm <- confusionMatrix(factor(predictions, levels=all_levels), factor(df_trans$diabetes,levels=all_levels), positive = "Yes")
  print(cm$table)
}

```
**ROC Curve**
```{r fig.align='center', fig.width=7, fig.height=5,'The plot shows the path of the ROC curve.'}

#create ROC curve

TPR = c(1,91500/(91500+0), 91356/(91356+144), 90632/(90632+868), 88160/(88160+3340), 0/(91500+0))
FPR = c(1,352/(352+8148), 4282/(4282+4218), 3193/(5307+3193), 2178/(2178+6322), 0/(8500+0))
plot(FPR, TPR, type = "b", main = "An Illustrative ROC Curve", col ="blue",
     xlab="1 - Specifity (FPR)", ylab = "Sensitivity (TPR)")
# add a off-diagonal representing random guess algorithm in binary prediction
abline(0,1, lty = 2, col = "red")
# legend
legend("bottomright", c("Logistic Model", "Random Guess"),
       col=c("blue", "red"), lty = 1:2, bty="n", cex = 0.9)
```
**AUC (Area under the Curve)**
AUC quantifies the performance of the ROC curve into a single value that ranges from 0 to 1. The plot below uses the Riemann Sum approximation to estimate the area under the curve.
```{r fig.align='center', fig.width=7, fig.height=5,'The plot estimates the area of the ROC Curve. The region is divided into subregions, A,B, and C. These values are used to calculate the area under the curve.' }
#AUC
TPR = round(c(1,91500/(91500+0), 91356/(91356+144), 90632/(90632+868), 88160/(88160+3340), 0/(91500+0)),3)
FPR = round(c(1,352/(352+8148), 4282/(4282+4218), 3193/(5307+3193), 2178/(2178+6322), 0/(8500+0)),3)
TPR0 = TPR[7:1]
FPR0 = FPR[7:1]
#plot(FPR0, TPR0, type = "b")
datSenSpe = data.frame(TPR0, FPR0)
ggROC = ggplot(data = datSenSpe, aes(x = FPR0, y=TPR0)) +
        geom_line(col = "steelblue") +
        geom_point(col = "red") +
        geom_segment(x = FPR0, y = 0, xend = FPR0, yend = TPR0, color = 4) +
        geom_segment(x = 0, y = 0, xend = FPR0[7], yend = 0, color = 6) +
        ggtitle("Approximating the AUC of Logistic Model") +
        xlab("1-specificity (FPR)") + 
        ylab("Sensitivity (TPR)") +
        annotate("text", x = 0.025, y = 0.125, label= "A") + 
        annotate("text", x = 0.105, y = 0.5, label = "B") +
        annotate("text", x = 0.605, y = 0.5, label = "C") +
        theme(plot.title = element_text(hjust = 0.5),
              legend.position = c(0.8, 0.2),
              plot.margin = unit(c(0.15, 0.15, 0.75, 0.15), "inches"),
              axis.line = element_line(size = 2, colour = "navy", linetype=1))
# partition the region under the ROC into trapezoids
ggplotly(ggROC)
```

**AUC Value**
```{r fig.align='center', fig.width=7, fig.height=5,'The area under the curve is 0.9795 which is close to 1, this indicates an ideal model.'}
A<-(0.041*1)/2
B<- ((0.504-0.041)*1)
C<- ((1-0.504)*1)

AUC<- A+B+C
kable(AUC, caption= "Area under the Curve")

```
# Conclusion for Model Fit
Our best fit model is :
$$
\text{ Diabetes} = 9.99 - 0.27\times \text{gender} - 1.03\times \text{age}  -0.8\times \text{hypertension} - 0.76\times \text{heart disease} -10.24\times \text{bmi} - 2.49\times \text{HbA1c level} - -10.3\times \text{blood glucose level}
$$


The best fit model was selected based on a variety of different features. Both stepwise selection and random forest selected the model as best fit. After running performance analyses on the model, the model was shown to be of high performance. This means that the variables chosen for the final model are significant for predicting diabetes. The model chose not to select smoking history as significant predictor for the response variable. A possible suggestion for this would be to reduce the number of response values to a binary response. Such as, "Have you ever smoked?" and have the response be "Yes" or "No". This could help in predicting the relationship between smoking and diabetes. 
